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Learning discriminative trajectorylet detector sets 
for accurate skeleton-based action recognition 

Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton von den Hengel 


Abstract —The introduction of low-cost RGB-D sensors has 
promoted the research in skeleton-based human action recogni¬ 
tion. Devising a representation suitable for characterising actions 
on the basis of noisy skeleton sequences remains a challenge, 
however. We here provide two insights into this challenge. 
First, we show that the discriminative information of a skeleton 
sequence usually resides in a short temporal interval and we 
propose a simple-but-effective local descriptor called trajectorylet 
to capture the static and kinematic information within this 
interval. Second, we further propose to encode each trajectorylet 
with a discriminative trajectorylet detector set which is selected 
from a large number of candidate detectors trained through 
exemplar-SVMs. The action-level representation is obtained by 
pooling trajectorylet encodings. Evaluating on standard datasets 
acquired from the Kinect sensor, it is demonstrated that our 
method obtains superior results over existing approaches under 
various experimental setups. 

Index Terms —Action recognition, Kinect sensor, exemplar 
support vector machines, feature learning, 3D action feature 
representation. 


I. Introduction 

T HE recognition of human actions is an active research 
held in recent years and much effort has been made to 
address this problem (2l) . Intuitively, a temporal sequence of 
3D skeleton joint locations captures sufficient information to 
distinguish between actions, but recording skeleton sequence 
was very expensive with the traditional motion capture tech¬ 
nology, which limits the applications to which it has been 
applied |T3). Recently, with the advent of RGB-D cameras 
such as Microsoft Kinect the acquisition of 3D skeleton 
data for action recognition has become much easier and 
faster |20) . This advance promotes a number of skeleton- 
based action recognition approaches [5j, |22| , 1271. The key 
challenge of these approaches is how to extract discriminative 
features from the noisy temporally evolving skeletons. 

The trajectory of skeletal joints in space-time is the direct 
representation of human actions. Earlier works oi. m 
model human action trajectory descriptors of variable-lengths 
and classify them based on the similarity matching of tra¬ 
jectories. In [1, an action representation is encoded with a 
histogram voted by the displacements of joint trajectories with 
respect to their orientations. In these works, the global feature 
is extracted from the whole trajectory. However, only a short 
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Two hand wave 



Hand clap 


Fig. 1 - Skeleton sequences from two action classes. Only the red skeletons 
show significant differences between the two sequences. In this example, 
less than 20% of the frames are required to tell whether the skeleton is 
clapping or waving. 


section of the trajectory is actually distinctive and can provide 
usable information about the action being undertaken. For 
example, as illustrated in Figure |T] only moments when the 
subject moves its hands, during the two actions of waving and 
clapping, are indicative of the performed action class, while all 
the remaining poses are irrelevant and potentially distracting. 
The abundant non-informative local patterns may cause large 
variance to the global trajectory. Compared to the global 
representation, later works (30), (5T) explore discriminative 
patterns to create local descriptors at frame-level. Despite 
its robustness, a frame-level descriptor, without additional 
temporal information, hardly depicts the movement of actions, 
and is insufficient for recognition. 

Different from the above-mentioned approaches which ei¬ 
ther represent an action with the whole sequence or extract lo¬ 
cal features at the frame level, we argue that the discriminative 
information of an action is better captured by a short interval 
of trajectories. This interval usually consists of several frames. 
In other words, its temporal range is longer than a single 
frame but much shorter than the whole skeleton sequence. 
To extract features from the trajectory interval, we make 
our first contribution by designing a novel local descriptor 
called trajectorylet to capture the static and dynamic motion 
information within the short interval. 

Furthermore, as we have observed, not all trajectorylets 
in a sequence are equally important for classification and 
the recognition performance generally benefits from focusing 
on the discriminative ones. In skeleton-based action recog¬ 
nition, recent works (3), [31] directly learn the discriminative 
frames from the training set. Unlike the aforementioned works, 
our approach does not explicitly look for the discriminative 
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trajectorylets, but rather provides a method for creating a 
set of detectors that fire on specific template trajectorylets. 
Our approach firstly applies exemplar-SVM ||T0j to learn a 
large number of candidate detectors and then selects detectors 
according to their discriminative performance over the trajec¬ 
torylets in the training set. We further cluster detectors into 
multiple clusters, and remove the redundancy of the learned 
detectors by selecting one representative detector from each 
cluster. The selected detectors form a template detector set and 
their detection scores on a trajectorylet is utilized as the coding 
vector of that trajectorylet. The action level representation is 
then obtained by pooling all trajectorylet coding vectors and 
temporal pyramid pooling can also be incorporated to capture 
the long range temporal information of the action sequence. 
In extensive experiments, this framework brings significant 
performance improvement over state-of-the-art approaches for 
skeleton-based action recognition. 

In summary, our first contribution is the trajectorylet, a 
novel local descriptor that captures static and dynamic infor¬ 
mation in a short interval of joint trajectories. In our second 
contribution, a novel framework is proposed to generate robust 
and discriminative representation for action instances from a 
set of learned template trajectorylet detectors. 

Following briefly reviewing related literature in Section [H] 
we propose the design of our local feature and detector 
learning method in Section [HI] We then present the action- 
level representation of an action instance in Section [rv| 
Our framework is experimentally evaluated in Section [V| and 
summarized in Section IvTl 


II. Related work 


The key challenge of skeleton-based action recognition is 
how to construct the action representation from a sequence 
of skeletal joints. Some video-based methods 0, (24) ex¬ 
tract trajectories of multiple tracking points, and compute 
descriptors along them, such as HOG, HOF and MBH. For 
skeleton-based methods, trajectories are directly obtained from 
the space-time evolution of skeletal joints. The most straight¬ 
forward way is to model the trajectory holistically, either by 
extracting statistics from the sequence or modelling its gen¬ 
erative process. In Q, a histogram records the displacements 
of joint orientations over the whole trajectory. In 1 16], the 
action is modelled with the pairwise affinities trajectories of 
joint angles. In (29) , the action sequence is modelled by the 
Hidden Markov Model with quantized histogram of spherical 
coordinates of joint locations as frame-level feature. In (22), 
3D geometric relationships between various body parts are 
modelled with a Lie group to represent the whole action. 

Besides directly modelling the trajectory holistically, it has 
also been noted that only a small fraction of patterns of a skele¬ 
tal sequence are actually distinctive and thus many approaches 
have been proposed to identify those discriminative patterns, 
whether these patterns are defined spatially or temporally. 

It has been found that not all skeletal joints are informative 
for distinguishing one action from the others, therefore it is 


beneficial to select a subset of joints. Ofli et al. 1151 select a 
subset of most informative joints according to criteria such as 


mean or variance of joint angles. In [251, joints are grouped 
into actionlets, and the most discriminative collection of them 
are mined via the multiple kernel learning approach. In (2), 
a subset of joints within a short-time interval is extracted 
according to the spatio-temporal hierarchy of the moving 
skeleton, and a linear combination of them is learned via a 
discriminative metric learning approach. In (23) , the distinctive 
set of body parts are mined from their co-occurring spatial 
and temporal configurations. In 11 j. an evolution algorithm 
is employed to select an optimal subset of joints for action 
representation and classification is performed by using DTW- 
based sequence matching. 

As most of the frames in an action sequence are comprised 
of non-distinctive static poses, features at a few discriminative 
temporal locations are informative enough to represent an 
action. In video-based action recognition, a number of key 
frame selection approaches have been proposed. In (32) , key 
frames are selected by ranking the conditional entropy of the 
codewords assigned to the frames. In (18) , the locations of key 
frames are modelled as latent variables and estimated for each 
action instance by dynamic programming. In recent works on 
skeleton-based action recognition, distinctive canonical poses 
|3) are learned via logistic regression, and discriminative 
frames HD are identified by their approximated confidence 
of belonging to a specific action class. In (30) , distinctiveness 
of each frame is calculated by a measurement of accumulated 
motion energy. 


III. The proposed action representation 

Our model utilizes the relationships between the positions 
of the J skeletal joints j_, = ( Xj,yj,Zj ) £ R 3 , j = 1 • • • J in 
the current and preceding frames to form a local trajectorylet. 
Because human skeleton size varies from different action 
instances, we perform a skeleton size normalization on the raw 
skeletal joints according to HD- We also subtract the position 
of the hip center j/^p from each joint and concatenate them to 
form a feature column: j = [ji - j hip, ■■■ Jj - j hip] G R 3J , 
making the origin point of the coordinate system across 
all frames and subjects. 


A. Trajectorylet 

Although holistic trajectories of joints depict the movement 
of human body, distinctive patterns are usually overwhelmed 
by common ones. For example, in long-term actions such as 
draw circle and draw tick, only the last moment of drawing 
movement distinguishes them, before which both trajectories 
share the same movement of raising up hand for a long 
time. On the other hand, as depicted in Figure [2] frame- 
level local descriptors record current poses and some local 
dynamics, but they fail to capture the movement that spans 
a long temporal range. To distinguish walk from run, for 
instance, we need to examine the displacement and speed of 
the joints within a sufficient period of time, rather than the 
static poses. Based on these observations, we propose our 
trajectorylet local descriptor, which captures the static and 
dynamic information of trajectories in a short period of time. 
Compared with frame-level descriptors, trajectorylet depicts 
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Skeleton Trajectorylet 


Fig. 2 - The joint coordinate information at frame-level may provide little 
information to distinguish between some action classes, such as the above 
drawing actions. One of the advantages of trajectory lets is their ability to 
focus on the dynamics of distinctive sections of individual actions. 


richer dynamic information. On the other hand, its temporal 
range is much smaller than the whole trajectory sequence and 
therefore it is less affected by potentially irrelevant frames. 

More specifically, considering a trajectorylet of length L 
starting from frame t 0 , we extract the static positions of the 
joints from each frame occurring before time to + L: 
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x*° = [f° T , j to+lT , • • • , j‘ 0+i_lT ] T G R (Lx3J) . (1) 

In order to retrieve the dynamic information within this 
interval, we inspect multiple levels of temporal dynamics such 


as displacement and velocity. 
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Aj* 0+i_1 , i = 2, ■ • • ,L- 1. 



where Aj to+I indicates the relative joint displacements of 
frame to + i from the first frame; A 2 j to+ * indicates the joint 
velocities of frame to + i from its previous frame within 
the trajectorylet. The static positions of Xq store the absolute 
spatial location of the trajectorylet. The temporal dynamics 
xj and xl, approximate the relatively kinematic evolution 
within this short time interval. Combining both static and 
dynamic information we define the 1-th trajectorylet for an 
action instance with F frames as 

x (4) = (xg T ,xf r ,X 2 T ) T G R (3i - 3)3J . (4) 

where t = 1, • • • ,F — L. 

PCA is applied on trajectorylets to reduce the their dimen¬ 
sion for our detector learning module. We still denote the final 


Fig. 3 - Visualization of trajetorylet of length 5 at a single joint (left 
hand). The red point is the position at the starting frame, and the green 
points are its positions at succeeding frames in this interval. The yellow 
segments are joint displacements from the first frame. The black segments 
are joint velocities at each frame. The top trajetorylet is part of drawing 
circle and the bottom trajetorylet is part of high waving. The differences 
between them are clearly distinguished by their positions, displacements 
and velocities over a short period of time. 

descriptor as G R d , d < (3 L — 3)3 J. Figure [T| visualizes 
components in a trajectorylet, including one static component 
and two dynamic components. 

B. Learning candidate detectors of discriminative trajetorylet 
using ESVM 

As we have previously discussed, only a small fraction 
of the trajectorylets from an sequence contains sufficient 
information for identifying the associated action. Most of the 
trajectorylets, especially those that contain the static posture, 
are shared by multiple action classes. Our aim is to learn a set 
of detectors that fire on the distinctive trajectorylets. To this 
end, we firstly resort to exemplar-SVM (ESVM) [10] to learn 
a large set of detectors for a large number of sampled trajecto¬ 
rylets, one for each sampled trajectorylet. Then for each action 
instance we select a few discriminative trajectorylet detectors 
as the candidate detectors of discriminative trajectorylet. 

An ESVM learns a decision boundary that achieves largest 
margin between an exemplar sample and a set of negative 
examples. If we take each trajectorylet as a positive exemplar 
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xb of its associated class c, c = 1, • ■ ■ C, and trajectory lets 
that belong to other action classes as the negative examples, 
we can train an exemplar-SVM for it and formally this can be 
formulated as: 


arg min ||w B || 2 + Ai/i(w] s x £ ; + b E ) (5) 

w E,b E 

+A 2 ^2 - b E ) 

xGA/i; 

where h(x) = max(0,1 — x) is the hinge loss function, 
and M c is the negative set of trajectorylets that do not belong 
to class c. Ai and A 2 denote the weights of loss for positive 
and negative samples respectively, and Aj > A 2 ensures that 
a greater penalty will be applied to the incorrectly classified 
positive exemplars. 

For each ESVM, the trained detector /(x) = wjjX + b E 
returns higher scores on trajectorylets that are most similar 
to x E . If the current exemplar trajectorylet is common in 
multiple action classes, the returned trajectorylets are abundant 
in multiple classes. On the contrary, if the current exemplar 
trajectorylet is unique for a single class, most returned trajec¬ 
torylets belong to the same class with the current exemplar 
trajectorylet. Thus we can employ the distribution of action 
classes of the returned trajectorylets to estimate the discrimi¬ 
native power of one detector. 

Given an action instance A, we extract Fa trajectorylet de¬ 
scriptors x-*) , t = 1, • ■ ■ Fa. and train the associated detectors 
(w®, b^)) , t = 1, • • • Fa- A selection method is implemented 
to find the most discriminative trajectorylet detector among 
the candidates. More specifically, we apply each detector 
(w E \b E 2 to the trajectorylets xW, i = 1 , • ••iV sampled 
from the whole training set and compute the detection scores 
r t i = w^ )T xW + b ( 2 [j From TZ t = {r ti } i= we choose 
a subset lZ' t , with the top Na scores, corresponding to the 
trajectorylets that are most compatible with current detector 
(w^\ b^))- For Na trajectorylets detected by (yv E \ 6^), 
we denote li t c> as the number of trajectorylets belonging to 
action class c . The histogram H t = \h^\ ■ ■ • , G K c 

gives a clear view of the distinctiveness of detector (w^\ b ^). 

If H t is flat across many classes, x (t> is a common pattern 
shared by many classes and its detector is therefore not 
distinctive. If the H t is centered mostly at the correct class, 
trajectorylet x^ is a distinctive pattern for this class and hence 
(w^, Ij ff ) is an effective detector of this distinctive pattern. 
In practice, if the correct class corresponding to (w^\ b^)) is 
c, we denote P t = h[ c ^ /Na as the ratio of correctly detected 
trajectorylets and a detector with higher P t is selected because 
it fires primarily on trajectorylets with the same class of it, 
verifying the distinctiveness of this detector. We summarize 
this approach in Algorithm [T] 


Algorithm 1 Find discriminative detectors for an action instance 

Input: Training action instance A of class c, trajectorylets within 
it {xW} i=1 .. .p A \ sampled training trajectorylets X = 
{x^)}j = i ...n', number of trajectorylets to retain: Na‘, maxi¬ 
mum number of detectors to be selected for the instance: Ma- 
Initialize: Set of discriminative detectors for instance A: Da = 0; 
number of discriminative detectors selected for the instance tyia — 0 . 
for t — 1 • • • Fa do 

■ Solve ESVM ->• (w^ ) ,fe^ ) ). 

• Compute detection scores on sampled trajectorylet set 

• Compute Ht from the top Na scored samples. 

• Compute the ratio of correctness Pt of Ht. 

end 

• Sort Pt by magnitude, storing the resulting (sorted) indeces in s. 
for t in s do 

■ v A =v A u( w ^\b^). 

■ m A = m A + 1. 

if m A > M a then 

I ■ Break. 

end 

end 

Output: Discriminative detectors for instance A: T> A . 


instances, which will lead to a very high-dimensional action 
representation and make the computation intractable. On the 
other hand, the above method might select similar distinc¬ 
tive detectors multiple times, resulting in a highly redundant 
detector set. To control the size of detector set and remove 
the redundancy of candidate detector set, we perform spectral 
clustering on candidate detectors and then select one detector 
from each cluster as the final detector set used for trajectorylet 
encoding. To build the affinity graph for spectral clustering, 
we need to specify the similarity measurement between two 
detectors. Here we measure this similarity by considering the 
“active detection scores” of two detectors which refer to the 
detection scores with positive values. We evaluate it by firstly 
calculating detection scores on N sampled trajectorylets and 
setting negative detection scores to zero. This process gives 
a N dimensional active detection score vector r d for each 
detector and the similarity between two detectors are measured 
as follows: 


Qdd' 


r >rf' 

r d|| ' Ill'll 


( 6 ) 


where || • || represents the l 2 norm, and i\i and r' d , denote the 
active detection score vectors for the two compared detectors. 
The value qdd' measures the similarity between two detectors 
and is used to build the affinity matrix Q for the detector set 
V. that is, Q = [qdd']d,d'=i,...,D- We apply spectral clustering 
to Q and obtain K < D clusters of detectors. The detectors 
within the same cluster fire on similar trajectorylets. From 
each cluster, we select a representative detector that produces 
the highest score on the sampled trajectorylets. In practice, 
given a sufficient large K, the collection of representative 
detectors can cover all discriminative trajectorylets. We call 
this collection the template trajectorylet detector set. 


C. Template detector set 

As the detectors are discovered from every action instance, 
the size of the detector set grows with the number of training 

'in order to measure the scores on the same scale, we adjust the trained 
parameters with unit norm before computing the scores. 


IV. Global descriptor and classification 

For the detectors in the template detector set, we evaluate 
their detection scores on each trajectorylet and max-pool those 
detection scores to obtain the action representation. Formally, 
let x^ G R n be the j-th trajectorylet of the i-th action, and 
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Fig. 4 - Overview of our feature learning framework. 


(w k,bk) be the fc-th detector in the template detector set. We 
define the action representation for the z-th action = 

[$ fc (x,)] fe= i i ... i K as: 

$ fc (xi) = max (wf x| + b k ), k = 1, • ■ ■ , K. (7) 

3 

We use a one-versus-all SVM to classify actions among the 
C action classes y, ; e {1, • ■ • , C}. 

The learned feature mapping <!>(■) governed by the template 
detector set serves as a global descriptor of the action instance. 
It maps temporally continuous trajectorylets into a higher-level 
representation. Also, <&(•) can not only map a complete se¬ 
quence of action, but also works for a temporal sub-sequence. 
This allows us to build a temporal pyramid representation of 
the action instance. For a 3-level temporal pyramid, the sub¬ 
sequences are F^ p \p = 1, • • ■ ,7, and the fc-th dimension of 
subfeature 3>( p )(xi) for sub-sequence p is 

$fc°(xi) = max (w Ixj +bk). (8) 

jeFM 

The concatenated T'(-) = [<j>W(-) T , • • • , <1?( 7 )(-) T ] T incor¬ 
porates the temporal information of the skeleton sequence. 
Therefore we are able to train a one-versus-all SVM with this 
feature that takes into account the global temporal information 
of the whole action sequence. 

V. Experiments 

We organize the experimental evaluation in four parts. We 
first compare our proposed method against other state-of-the- 
art methods on two standard datasets obtained from the Kinect 
sensor. Then we analyze the performance of our method under 
different parameter settings. Since our method consists of two 
modules, the trajectorylet descriptor and the template detector 
learning based middle-level feature representation, we conduct 
two experiments to separately evaluate their impact on the 
classification performance. To examine the first module, we 
compare our descriptor against the descriptor of |[3T), which 
is most related to our trajectorylet descriptor, by keeping the 
other settings of the recognition system the same. We also 
compare our descriptor with its several alternative variants. 
To examine the second module, we compare our method with 
alternative way to obtain constructed from three state-of-the- 
art middle-level feature representation methods: VLAD |7j, 
LSC (9), and LLC (26). 


AS1 

AS2 

AS3 

Horizontal arm wave 
Hammer 

Forward punch 

High throw 

Hand clap 

Bend 

Tennis serve 

Pickup & throw 

High arm wave 
Hand catch 

Draw x 

Draw tick 

Draw circle 

Two hand wave 
Forward kick 

Side boxing 

High throw 
Forward kick 

Side kick 

Jogging 

Tennis swing 
Tennis serve 

Golf swing 

Pickup & throw 


TABLE I - The classes in the three action subsets of the MSR Action3D 
dataset. 


Protocol of [8] 

AS1 

AS2 

AS 3 

Average 

3DBag |8| 

H03DJLZ9) 

Eigen Joints f |30| 

HOD @ ,-T 

Lie Group 1221 

EJS (lj U 
Moving Pose |31ij 2 
Ours 

72.9 

88.0 

74.5 
92.4 

95.3 

91.6 

96.4 
96.4 

71.9 

85.5 

76.1 

90.1 

83.9 
90.8 

91.6 
97.5 

79.2 
63.5 

96.4 

91.4 

98.2 

97.3 
99.1 

100.0 

74.7 
79.0 
82.3 

91.2 
92.5 

93.2 

95.7 
97.9 


TABLE II - Results on 3 subsets of the MSR Action3D dataset. 


Implementation details: The ESVMs are implemented by 
liblinear [4j, which produces about 5 candidate detectors 
per second on an Intel Core i7 CPU at 3.40GHz. We set 
the regularization parameters as Ai = 10 and A 2 = 0.01 
for all ESVMs. There are on average 9,000 and 30,000 
local descriptors in the negative sets for MSR Action3D 
and MSR DailyActivity3D respectively. The dimensionality 
of trajectorylets is reduced to 50% percent of it by PCA. 
As the testing data will not be known in advance, the PCA 
coefficients // and covariance matrix are learned from the 
training data only. Unless indicated otherwise, the length of 
trajectory descriptor is set to L = 5. The regularization 
parameter for the final one-versus-all SVM is determined 
by a five-fold cross-validation. We apply a 3-level temporal 
pyramid on MSR DailyActivity3D only, because it contains 
complex actions which involves several sub-actions and the 
long-range temporal information can be useful in such a case. 

A. MSR Action3D 

-We use the code of |31| to obtain this result, as the original work did not 
report the results according to the protocol of |j3J. 
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Protocol of 1251 

Accuracy 

Recurrent Neural Network 1111 

42.5 

Dynamic Temporal WarpingTT4| 

54.0 

Canonical Poses [31 

65.7 

DBM+HMM 127 [ 

82.0 

JAS (skeleton data only) fl6 | 

83.5 

Actionlet Ensemble |251 

88.2 

HON4D [171 

88.9 

Lie Group (z2 

89.5 

LDS 121 3 

90.0 

Pose based |23| 

90.2 

Moving Pose pit 

91.7 

Ours 

95.9 


TABLE III - Results on the entire MSR Action3D dataset. 


The MSR Action3D dataset consists of human actions 
expressed with skeletons composed of 20 3D body joint 
positions in each frame. The 20 joints are connected by 19 
limbs. There are 20 action classes performed by 10 subjects for 
2 or 3 times each, making up 567 action instances. Each action 
instance contains a temporal sequence of a moving skeleton, 
usually in 30-50 frames. As in [25j and m- we drop 10 
instances because they contain erroneous data. The experiment 
setup is that of a cross-subject test |8|, i.e. instances of half 
of the subjects are used for training and instances of the other 
half subjects are used for testing. We construct H t with top 
responding Na = 50 trajectorylets, and select Ma = 10 best 
detectors for each training instance. We use the clustering 
method of section III-C to obtain the template trajectorylet 
detector set. The final number of template detectors is set to 
K = 500. 


In Table [IT] we compare our approach with other state- 
of-the-art methods using the protocol of |8||, by which the 
20 action classes are grouped into 3 action subsets AS1, 
AS2, and AS3. The training and testing is performed on each 
action set separately. AS1 and AS2 group actions with similar 
movements while AS3 group complex actions. The action 
classes of each action subset are listed in Table [I] On average, 
our proposed method is more accurate than all other methods. 
On AS2, all other methods get moderate accuracy and in 
contrast our method outperforms the second best by 5.9%. 
Note that on AS3, our method achieves perfect recognition. 


In Table III a more challenging protocol of |25) is used. 
Here the model is trained and tested over all 20 action classes. 
The results show that our method still obtains a highly accurate 
recognition rate, outperforming the current best state-of-the-art 
by a margin of 4.2%. The confusion matrix of our method 
on this dataset under the second protocol is displayed in 
Figure[5] where 16 of 20 action classes are perfectly classified. 
The only highly misclassified class is hammer , because its 
distinctive pattern involves human-object interaction, which is 
not captured by the skeleton data. 


3 It should be noted that the result of [2 ] here is not obtained under the same 
setting as ours. This approach selected a subset of 17 actions performed by 
8 subjects, 5 for training and 3 for testing, consisting of 379 action instances 
in total. 


B. MSR Daily A ctivityS D 

In MSR DailyActivity3D, there are 16 action classes per¬ 
formed by 20 subjects twice, making up 320 action instances. 
Each subject performs an action class in two variants (e.g. 
sitting versus standing, or in front of versus behind an object). 
This dataset has longer sequences, usually in 100-300 frames. 
We still follow the cross-subject test in protocol of |25) and 
0, where training and testing are conducted over all action 
classes. Because this dataset contains more local information 
than MSR Action3D, we constmct H t with top responding 
Na = 50 trajectorylets, select Ma = 15 best detectors for 
each training instance, and reduce the final number of clustered 
detectors to K = 500. 


Methods 

Accuracy 

Dynamic Temporal Waiping 114] 

Actionlet Ensemble (skeleton data only) |25[ 
Moving Pose |3l] 4 

Ours 

54.0 

68.0 

70.6 

75.0 


TABLE IV - Results on the MSR DailyActivity3D dataset. 


We compare our approach with other state-of-the-art meth¬ 
ods in Table IV As the purpose of this experiment is to 
address skeleton-based action recognition, some best reported 
results ED’ © on this dataset using additional RGB-D 
data are not comparable to our method, and therefore we 
cite the result of |25] using only skeleton data. Although 
MSR DailyActivity3D share the same data structure as the 
MSR Action3D, it is much more challenging because: 1) the 
activities are complex combinations of multiple sub-actions, 
2) human-object interaction information is not available in 
skeleton data, 3) partial occlusion by interacting objects causes 
the skeleton data to be highly noisy. However, the results 
show that our approach still outperforms all other state-of- 
the-art methods. As shown in Figure [6] most of the poorly 
classified actions involves interaction with objects, such as 
read book, call cellphone, and use laptop. On the other hand, 
non-interactive action classes like cheer up, walk, and sit down, 
are recognized with high accuracy. This demonstrates that our 
method is able to capture distinctive patterns of actions in 
terms of “movement”, but may be confused if some actions 
share similar “movement” patterns despite the presences of 
different interacting objects, because they are not described in 
the skeleton data. 


C. Parameter analysis 

In this section we analyse how the parameter settings 
affect the performance. Using the same protocol of [25], we 
provide results of MSR Action3D dataset from other parameter 
settings. Figure [7] illustrates the performances of our method 
as K ranges from {25, 50, 100, 200, 300, ... , 1000}, while 
keeping Na = 50 and Ma = 10. When we set the size of 
detector set more than 500, the results tend to converge to 


4 Although the reported result in [31 [ is 73.8%, we never achieved this 
accuracy with their code due to environmental factors. For a fair comparison, 
we used the result 70.6%, which is the best performance under the same 
environment and setting with our approach. 
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high wave 
horizontal wave 
hammer 
hand catch 
forward punch 
high throw 
draw x 
draw tick 
draw circle 
hand clap 
two hand wave 
side-boxing 
bend 
forward kick 
side kick 
jogging 
tennis swing 
tennis serve 
golf swing 
pick up & throw 





Fig. 5 - Confusion matrix of our approach on the MSR Action3D dataset: except for the hammer class, all other action classes are classified with more 
than 80% accuracy. 16 out 20 action classes are perfectly classified. 



5 

10 

15 

20 

30 

50 

5 

91.7 






10 

92.7 

93.4 





20 

93.1 

93.1 

94.1 

94.8 



30 

94.8 

95.2 

94.1 

93.8 

94.8 


50 

95.5 

95.9 

95.9 

94.8 

94.8 

94.2 


TABLE V — Results from different pairs of the Ma and Na on MSR 
Action3D: we can obtain the best performance from multiple choices. 



5 

10 

15 

20 

30 

50 

5 

68.7 






10 

68.1 

69.4 





20 

68.7 

71.2 

70.0 

69.4 



30 

70.0 

73.1 

73.8 

71.2 

71.2 


50 

73.1 

74.3 

75.0 

75.0 

74.3 

71.9 


TABLE VI - Results from different pairs of the Ma and Na on MSR 
Daily Activity 3D. 


a value above 94.5%. Table [V] presents results of choosing 
different pairs of Ma and Na while keeping K = 500. 

For the MSR DailyActivity3D dataset. Figure [8] illustrates 
the performances of our method as K ranges from {25, 50, 
100, 200, 300, ... , 1000}, while keeping Na = 50 and Ma = 
15. When K is set to more than 500, the results become stable. 
The effect of choosing different pairs of Ma and Na is listed 
in in Table |VI| When Ma is large enough, the results variation 
becomes small. It can be observed that, on both datasets, there 
are multiple choices of parameters that are able to produce the 
optimal result and this verifies the robustness of our approach. 

Table VII shows the results under different temporal pyra¬ 
mid settings for the two datasets. A typical 3-level pyramid is 
the best choice for MSR DailyActivity3D as low level pyra¬ 
mids fail to grasp the temporal information while higher level 


TP level 

i 

2 

3 

4 

Action 

95.9 

92.4 

89.7 

N/A 

DailyActivity 

66.3 

70.6 

75.0 

68.8 


TABLE VII - Results obtained from different temporal pyramid levels on 
MSR Action3D and MSR DailyActivity3D datasets. 


ones brings too much noise. On the other hand, when temporal 
pyramid is applied to MSR Action3D, the performance is 
worsened. 

D. Power of local trqjectorylet descriptor 

The moving pose descriptor proposed in ED captures local 
information at frame-level of human skeleton actions. Our 
trajectorylet can be seen as a natural extension of it in the sense 
that we extend the dynamic information from frame-level 
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Fig. 8 - Recognition accuracy obtained from varying K on the MSR Daily 
Activity 3D dataset: when K > 400 the results become stable. 


Fig. 6 - Confusion matrix of our approach on the MSR DailyActivity3D 
dataset: although this is a challenging dataset for skeleton-based action 
recognition, 11 out of 16 classes are classified with more 70% accuracy. 



Fig. 7 - Recognition accuracies obtained from varying K on the MSR 
Action 3D dataset: when K > 500 the results become stable. 


We also test the effect of using different components of 
the trajectorylet descriptor. In our experiment, we examine 
the performance of single dynamic components, including 
static pose xo, relative joint displacement xi, velocity X 2 , 
and their combinations. We also further define an acceleration 
component analogous to ([2]) and (|3): 

x *0 _ [A 3 j* 0+2T , • • • , A 3 j t o+- £ '- lT ] T g Jj((T-3)x3Jj^ (Qj 

A 3 j to+i = A 2 j* 0+i - A 2 j to+i_1 , i = 3, • • • ,L- 1. 

The results of varying settings of a trajectorylet with L = 5 
are listed in Table [IX] We find that the dynamic components 
of X! and x 2 alone do not show promising results. However, 
when combined with static xo, the performance is significantly 
improved. Table [IXl also shows that the additional acceleration 
component in ([9]) does not improve the performance. 


Descriptors 

MSR Action 

MSR Daily Activity 

Moving Pose |31| 

91.7 

71.3 

Ours(L = 3) 

93.1 

72.5 

Ours(L = 5) 

95.9 

75.0 

OursfL = 7) 

95.9 

73.1 


TABLE VIII - Comparison of using different descriptors. 


range to a longer temporal range. In order to demonstrate the 
power of our descriptor we now apply our template detector 
learning framework to moving pose descriptor and compare 
its performance with that of trajectorylet. 

In order to evaluate the effect of varying L on performance 
we have varied the length our trajectorylet from (3, 5, 7). 
Table VIII shows that using the same detector learning and 
classification approach, trajectorylets achieve better results on 
both datasets for all tested values of L. As seen, this extension 
of moving pose descriptor is superior over the original design. 
It is worth noting that performance does not necessarily 
improve as the length of trajectorylets increases. A moderate 
length of trajectorylet (L = 5) leads to the best performance. 


Component 

MSR Action 

MSR Daily Activity 

X() 

92.4 

72.5 

xi 

91.7 

50.3 

X2 

90.3 

42.5 

(x 0 ,Xl) 

93.8 

73.1 

(x 0 ,Xl,X2) 

95.9 

75.0 

(x 0 ,Xl,X2,X 3 ) 

95.9 

74.3 


TABLE IX - Comparison of different using different components of 
trajectorylet (L = 5). 


E. Power of template detector learning 

Our method generates action representation from learned 
detector set of discriminative trajectorylets. In this section. 
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Fig. 9 - Some examples responding on the template detector set of MSR Action3D. The black curves represent current trajectory lets of the red skeleton. 
The fact that the our approach identifies discriminative patterns of movement seems clear. 


Method 

MSR Action 

MSR DailyActivity 

VLAD 17 i 

83.1 

51.9 

LLC (26f 

90.7 

65.6 

LSC [91 

92.1 

66.9 

Ours 

95.9 

75.0 


TABLE X - Comparison of feature learning methods. 


we compare this method with three state-of-the-art bag-of- 
feature techniques that learn middle-level feature from the 
same local trajectorylet feature: VLAD (vector of locally 
aggregated descriptors) 0, LLC (locality-constrained linear 
coding) |26|, and LSC (localized soft-assignment coding) [9|. 

We train codebook of the same size I\ = 128 with k- 
means for all three methods, and set the neighbourhood size 
of codewords as k = 10 for LSC and LLC. The results 


listed in Table [X] show, for the task of action recognition, 
our proposed feature learning framework produces the most 
discriminative action representation, compared with the state- 
of-the-art methods. Figure [9]illustrates some trajectorylets fired 
on the template detector set of MSR Action3D. It is clear that 
they show representative patterns for the corresponding action 
classes. 

VI. Conclusion 

This work describes an effective skeleton-based action ap¬ 
proach that achieves high accuracy on the relevant benchmark 
datasets. The keys to this performance are two factors. We 
propose trajectorylet, a novel local descriptor that captures 
static and dynamic information in a short interval of joint 
trajectories. We also devise a novel framework to generate 
robust and discriminative representation for action instances 
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by learning a set of distinctive trajectorylet detectors. On 
two benchmark datasets acquired from the Kinect sensor, our 
method outperforms, to our knowledge, all existing approaches 
by a significant margin. We also separately demonstrate the 
validity of our local descriptors and template detector learning 
method. To further expand our framework, we plan to incorpo¬ 
rate local temporal information to enable real-time detection, 
as well as investigate the RGB data to study the involvement 
of human-object interactions. 

References 

[1] A. A. Chaaraoui, J. R. Padilla-Lopez, P. Climent-Perez, and F. Florez- 
Revuelta. Evolutionary joint selection to improve human action recogni¬ 
tion with RGB-D devices. Expert Syst. Appl., 41(3):786—794, February 
2014. 

[2] R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal. Bio¬ 
inspired dynamic 3d discriminative skeletal features for human action 
recognition. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., 
pages 471^-78, June 2013. 

[3] C. Ellis, S. Z. Masood, M. F. Tappen, J. J. Laviola, Jr., and R. Suk- 
thankar. Exploring the trade-off between accuracy and observational 
latency in action recognition. Int. J. Computer Vision, 101(3):420—436, 
2013. 

[4] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. 
Liblinear: A library for large linear classification. J. Machine Learning 
Research, 9:1871-1874, 2008. 

[5] M. A. Gowayyed, M. Torki, M. E. Hussein, and M. El-Saban. Histogram 
of oriented displacements (hod): describing trajectories of human joints 
for action recognition. In Proc. Int. Joint Conf. Artificial Intelligence, 
2013. 

[ 6 ] J. Han, L. Shao, D. Xu, and J. Shotton. Enhanced computer vision with 
microsoft kinect sensor: A review. IEEE T. Cybernetics, 43(5): 1318— 
1334, 2013. 

[7] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local 
descriptors into a compact image representation. In Proc. IEEE Conf. 
Comp. Vis. Patt. Recogn., pages 3304-3311, June 2010. 

[ 8 ] W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3d 
points. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., 
2010. 

[9] L. Liu, L. Wang, and X. Liu. In defense of soft-assignment coding. In 
Proc. IEEE Int. Conf. Comp. Vis., pages 2486-2493, Nov 2011. 

[10] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms 
for object detection and beyond. In Proc. IEEE Int. Conf Comp. Vis., 
2011 . 

[11] J. Martens and I. Sutskever. Learning recurrent neural networks with 
hessian-free optimization. In Proc. Int. Conf. Mach. Learn., pages 1033— 
1040, New York, NY, USA, June 2011. ACM. 

[12] R. Messing, C. Pal, and H. Kautz. Activity recognition using the velocity 
histories of tracked keypoints. In Proc. IEEE Int. Conf. Comp. Vis., pages 
104-111, Sept 2009. 

[13] T. B. Moeslund, A. Hilton, and V. Kruger. A survey of advances in 
vision-based human motion capture and analysis. Comput. Vis. Image 
Underst., 104(2):90-126, 2006. 

[14] M. Muller and T. Roder. Motion templates for automatic classification 
and retrieval of motion capture data. In ACM SIGGRAPH/Eurographics 
Symposium on Computer Animation, pages 137-146, 2006. 

[15] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. Sequence 
of the most informative joints (smij): A new representation for human 
skeletal action recognition. In Proc. Workshops of IEEE Conf. Comp. 
Vis. Patt. Recogn., pages 8-13, June 2012. 

[16] E. Ohn-bar and M. M. Trivedi. Joint angles similiarities and hog 2 for 
action recognition. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. 
Recogn., 2013. 

[17] O. Oreifej and Z. Liu. HON4D: Histogram of oriented 4d normals for 
activity recognition from depth sequences. In Proc. IEEE Conf. Comp. 
Vis. Patt. Recogn., 2013. 

[18] M. Raptis and L. Sigal. Poselet key-framing: A model for human activity 
recognition. In Proc. IEEE Conf Comp. Vis. Patt. Recogn., pages 2650- 
2657, Washington, DC, USA, 2013. IEEE Computer Society. 

[19] Z. Shao and Y. Li. A new descriptor for multiple 3d motion trajectories 
recognition. In Proc. IEEE Int. Conf Robotics and Automation, pages 
4749 ^ 754 , May 2013. 

[20] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, 
A. Kipman, and A. Blake. Real-time human pose recognition in 
parts from single depth images. In Proc. IEEE Conf. Comp. Vis. 


Patt. Recogn., pages 1297-1304, Washington, DC, USA, 2011. IEEE 
Computer Society. 

[21] P. Turaga, R. Chellappa, V. Subrahmanian, and O. Udrea. Machine 
recognition of human activities: A survey. IEEE T. Circuits & Systems 
for Video Technology, 18(11): 1473—1488, 2008. 

[22] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition 
by representing 3d skeletons as points in a Lie group. In Proc. IEEE 
Conf. Comp. Vis. Patt. Recogn., pages 588-595, June 2014. 

[23] C. Wang, Y. Wang, and A. Yuille. An approach to pose-based action 
recognition. In Proc. IEEE Conf Comp. Vis. Patt. Recogn., pages 915— 
922, June 2013. 

[24] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by 
dense trajectories. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 
3169-3176, June 2011. 

[25] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for 
action recognition with depth cameras. In Proc. IEEE Conf. Comp. Vis. 
Patt. Recogn., pages 1290-1297, June 2012. 

[26] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality- 
constrained linear coding for image classification. In Proc. IEEE Conf. 
Comp. Vis. Patt. Recogn., pages 3360-3367, June 2010. 

[27] D. Wu and L. Shao. Leveraging hierarchical parametric networks for 
skeletal joints based action segmentation and recognition. In Proc. IEEE 
Conf. Comp. Vis. Patt. Recogn., pages 724-731, June 2014. 

[28] S. Wu, Y. Li, and J. Zhang. A hierarchical motion trajectory signature 
descriptor. In Proc. IEEE Int. Conf Robotics and Automation, pages 
3070-3075, May 2008. 

[29] L. Xia, C. Chen, and J. Aggarwal. View invariant human action 
recognition using histograms of 3d joints. In Proc. Workshops of IEEE 
Conf. Comp. Vis. Patt. Recogn., pages 20-27. IEEE, 2012. 

[30] X. Yang and Y. Tian. Eigenjoints-based action recognition using nave- 
bayes-nearest-neighbor. In Proc. Workshops of IEEE Conf. Comp. Vis. 
Patt. Recogn., pages 14-19, 2012. 

[31] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The “moving pose”: 
An efficient 3d kinematics descriptor for low-latency action recognition 
and detection. In Proc. IEEE Int. Conf. Comp. Vis., 2013. 

[32] Z. Zhao and A. Elgammal. Information theoretic key frame selection 
for action recognition. In Proc. British Mach. Vis. Conf, 2008. 



